Off-Policy Actor-Critic 



Thomas Degris thomas.degris@inria.fr 
Flowers Team, INRIA, Talence, ENSTA-ParisTech, Paris, France 

Martha White whitem@cs.ualberta.ca 

Richard S. Sutton sutton@cs.ualberta.ca 

RLAI Laboratory, Department of Computing Science, University of Alberta, Edmonton, Canada 



Abstract 

This paper presents the first actor-critic al- 
gorithm for off-policy reinforcement learning. 
Our algorithm is online and incremental, and 
its per-time-step complexity scales linearly 
with the number of learned weights. Pre- 
vious work on actor-critic algorithms is lim- 
ited to the on-policy setting and does not 
take advantage of the recent advances in off- 
policy gradient temporal-difference learning. 
Off-policy techniques, such as Grecdy-GQ, 
enable a target policy to be learned while 
following and obtaining data from another 
(behavior) policy. For many problems, how- 
ever, actor-critic methods are more practical 
than action value methods (like Greedy-GQ) 
because they explicitly represent the policy; 
consequently, the policy can be stochastic 
and utilize a large action space. In this pa- 
per, we illustrate how to practically combine 
the generality and learning potential of off- 
policy learning with the flexibility in action 
selection given by actor-critic methods. We 
derive an incremental, linear time and space 
complexity algorithm that includes eligibility 
traces, prove convergence under assumptions 
similar to previous off-policy algorithms, and 
empirically show better or comparable per- 
formance to existing algorithms on standard 
reinforcement-learning benchmark problems. 

The reinforcement learning framework is a general 
temporal learning formalism that has, over the last 
few decades, seen a marked growth in algorithms and 
applications. Until recently, however, practical online 
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methods with convergence guarantees have been re- 
stricted to the on-policy setting, in which the agent 
learns only about the policy it is executing. 

In an off-policy setting, on the other hand, an agent 
learns about a policy or policies different from the one 
it is executing. Off-policy methods have a wider range 
of applications and learning possibilities. Unlike on- 
policy methods, off-policy methods are able to, for ex- 
ample, learn about an optimal policy while executing 
an exploratory policy (Sutton & Barto, 1998), learn 
from demonstration (Smart & Kaclbling, 2002), and 
learn multiple tasks in parallel from a single sensori- 
motor interaction with an environment (Sutton et al., 
2011). Because of this generality, off-policy methods 
are of great interest in many application domains. 

The most well known off-policy method is Q-lcarning 
(Watkins & Dayan, 1992). However, while Q-Learning 
is guaranteed to converge to the optimal policy for the 
tabular (non-approximate) case, it may diverge when 
using linear function approximation (Baird, 1995). 
Least-squares methods such as LSTD (Bradtke & 
Barto, 1996) and LSPI (Lagoudakis & Parr, 2003) can 
be used off-policy and are sound with linear function 
approximation, but are computationally expensive; 
their complexity scales quadratically with the num- 
ber of features and weights. Recently, these problems 
have been addressed by the new family of gradient- 
TD (Temporal Difference) methods (e.g., Sutton et 
al., 2009), such as Greedy-GQ (Maei et al., 2010), 
which are of linear complexity and convergent under 
off-policy training with function approximation. 

All action-value methods, including gradient-TD 
methods such as Greedy-GQ, suffer from three impor- 
tant limitations. First, their target policies are deter- 
ministic, whereas many problems have stochastic op- 
timal policies, such as in adversarial settings or in par- 
tially observable Markov decision processes. Second, 
finding the greedy action with respect to the action- 
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value function becomes problematic for larger action 
spaces. Finally, a small change in the action-value 
function can cause large changes in the policy, which 
creates difficulties for convergence proofs and for some 
real-time applications. 

The standard way of avoiding the limitations of action- 
value methods is to use policy-gradient algorithms 
(Sutton et al., 2000) such as actor-critic methods 
(e.g., Bhatnagar et al., 2009). For example, the nat- 
ural actor-critic, an on-policy policy-gradient algo- 
rithm, has been successful for learning in continuous 
action spaces in several robotics applications (Peters 
& Schaal, 2008). 

The first and main contribution of this paper is to 
introduce the first actor-critic method that can be ap- 
plied off-policy, which we call Off-PAC, for Off-Policy 
Actor-Critic. Off-PAC has two learners: the actor and 
the critic. The actor updates the policy weights. The 
critic learns an off-policy estimate of the value func- 
tion for the current actor policy, different from the 
(fixed) behavior policy. This estimate is then used 
by the actor to update the policy. For the critic, in 
this paper we consider a version of Off-PAC that uses 
GTD(A) (Maei, 2011), a gradient-TD method with el- 
igibitity traces for learning state-value functions. We 
define a new objective for our policy weights and derive 
a valid backward-view update using eligibility traces. 
The time and space complexity of Off-PAC is linear in 
the number of learned weights. 

The second contribution of this paper is an off-policy 
policy-gradient theorem and a convergence proof for 
Off-PAC when A = 0, under assumptions similar to 
previous off-policy gradient-TD proofs. 

Our third contribution is an empirical comparison of 
Q(A), Greedy-GQ, Off-PAC, and a soft-max version of 
Greedy-GQ that we call Softmax-GQ, on three bench- 
mark problems in an off-policy setting. To the best 
of our knowledge, this paper is the first to provide 
an empirical evaluation of gradient-TD methods for 
off-policy control (the closest known prior work is the 
work of Delp (2011)). We show that Off-PAC outper- 
forms other algorithms on these problems. 

1. Notation and Problem Setting 

In this paper, we consider Markov decision processes 
with a discrete state space S, a discrete action space A, 
a distribution P : S x S x A -> [0, 1], where P(s'\s, a) 
is the probability of transitioning into state s' from 
state s after taking action a, and an expected reward 
function 1Z: SxAxS^R that provides an expected 
reward for taking action a in state s and transitioning 



into s' . We observe a stream of data, which includes 
states St G 5, actions at G A, and rewards r t G R for 
t = 1,2,... with actions selected from a fixed behavior 
policy, b(a\s) G (0, 1]. 

Given a termination condition 7 : S — > [0, 1] (Sutton ct 
al., 2011), we define the value function for ir : S x A — >• 
(0, 1] to be: 

V^{s) = E [r f+1 + . . . + r t+T \s t = s] Vs G S (1) 

where policy 7r is followed from time step t and ter- 
minates at time t + T according to 7. We assume 
termination always occurs in a finite number of steps. 

The action-value function, Q 7r,7 (s, a), is defined as: 

Q K «(a,a) = 

P(s'\s, a)[K(s, a, s') + 7(s')V^(s')) (2) 

s'es 

for all a G A and for all s G S. Note that V*^(s) = 

Eae^faW^M, fOT a11 S e S - 

The policy tt u : A x S — > [0, 1] is an arbitrary, diffcrcn- 
tiable function of a weight vector, u 6 M. Nm , N u G N, 
with Tr u (a\s) > for all s G <S, a G A. Our aim is to 
choose u so as to maximize the following scalar objec- 
tive function: 

J 7 (u) = £d b ( s )^( S ) (3) 

where d b (s) = lim^oo P(s t = s\so,b) is the limiting 
distribution of states under b and P(s t = s\so,b) is 
the probability that St — s when starting in so and 
executing b. The objective function is weighted by 
d b because, in the off-policy setting, data is obtained 
according to this behavior distribution. For simplicity 
of notation, we will write n and implicitly mean 7r u . 

2. The Off-PAC Algorithm 

In this section, we present the Off-PAC algorithm in 
three steps. First, we explain the basic theoretical 
ideas underlying the gradient-TD methods used in the 
critic. Second, we present our off-policy version of the 
policy-gradient theorem. Finally, we derive the for- 
ward view of the actor and convert it to a backward 
view to produce a complete mechanistic algorithm us- 
ing eligibility traces. 

2.1. The Critic: Policy Evaluation 

Evaluating a policy n consists of learning its value 
function, V r7r,7 (s), as defined in Equation [l] Since 
it is often impractical to explicitly represent every 
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state s, we learn a linear approximation of V 7T '" l (s): 
V(s) = v T x s where x s G R Nv , N v e N, is the feature 
vector of the state s, and v G M. Nv is another weight 
vector. 



Gradient-TD methods (Sutton et al., 2009) incremen- 
tally learn the weights, v, in an off-policy setting, 
with a guarantee of stability and a linear per-time-step 
complexity. These methods minimize the A-weighted 
mean-squared projected Bellman error: 

MSPBE(v) = \ \V -UT^V \\ 2 D 

where V = Xv; X is the matrix whose rows are all x s ; 
A is the decay of the eligibility trace; D is a matrix with 
d b (s) on its diagonal; II is a projection operator that 
projects a value function to the nearest representable 
value function given the function approximator; and 
T^t is the A -weighted Bellman operator for the target 
policy 7r with termination probability 7 (e.g., see Maei 
& Sutton, 2010). For a linear representation, n = 
X{X T DX)- 1 X T D. 

In this paper, we consider the version of Off-PAC that 
updates its critic weights by the GTD(A) algorithm 
introduced by Maei (2011). 

2.2. Off-policy Policy-gradient Theorem 

Like other policy gradient algorithms, Off-PAC up- 
dates the weights approximately in proportion to the 
gradient of the objective: 



u t+ i u t sa a Ujt V u J 7 (u t ) 



(4) 



where a u _t G K is a positive step-size parameter. Start- 
ing from Equation |3j the gradient can be written: 



V u J 7 (u) = V u 



£d b ( S )5>(a| S )<7^( S ,a) 
ses aeA 

H 7 r(a|s)V u Q 7r ' 7 (s,a) 



ses 



aeA 



The final term in this equation, V u Q 7r ' 7 (s, a), is dif- 
ficult to estimate in an incremental off-policy setting. 
The first approximation involved in the theory of Off- 
PAC is to omit this term. That is, we work with 
an approximation to the gradient, which we denote 
g(u) G R Nu , defined by 

V u J 7 (u) « g(u) = Y, db ( s ) E V u 7r(a|s)Q^(s, a) 



ses 



aeA 



(5) 

The two theorems below provide justification for this 
approximation. 



Theorem 1 (Policy Improvement). Given any policy 
parameter u, let 

u' = u + ag(u) 

Then there exists an e > such that, for all positive 
a < e, 

J 7 (u') > J 7 (u) 

Further, if ?r has a tabular representation (i.e., sepa- 
rate weights for each state), then V n "' , ' y (s) > V n "'' r (s) 
for all s £ S . 

(Proof in Appendix). 

In the conventional on-policy theory of policy-gradient 
methods, the policy-gradient theorem (Marbach & 
Tsitsiklis, 1998; Sutton et al., 2000) establishes the re- 
lationship between the gradient of the objective func- 
tion and the expected action values. In our notation, 
that theorem essentially says that our approximation 
is exact, that g(u) = V u J 7 (u). Although, we can not 
show this in the off-policy case, we can establish a re- 
lationship between the solutions found using the true 
and approximate gradient: 

Theorem 2 (Off-Policy Policy- Gradient Theorem). 
Given U C M. N " a non-empty, compact set, let 



Z = {u G U I g(u) = 0} 
Z = {ueU I V u J 7 (u) = 



0} 



where Z is the true set of local maxima and Z the set 
of local maxima obtained from using the approximate 
gradient, g(u). If the value function can be represented 
by our function class, then Z C Z. Moreover, if we 
use a tabular representation for it, then Z — Z . 

(Proof in Appendix). 

The proof of Theorem [5J showing that Z = Z, requires 
tabular 7r to avoid update overlap: updates to a single 
parameter influence the action probabilities for only 
one state. Consequently, both parts of the gradient 
(one part with the gradient of the policy function and 
the other with the gradient of the action-value func- 
tion) locally greedily change the action probabilities 
for only that one state. Extrapolating from this re- 
sult, in practice, more generally a local representation 
for 7r will likely suffice, where parameter updates influ- 
ence only a small number of states. Similarly, in the 
non-tabular case, the claim will likely hold if 7 is small 
(the return is myopic), again because changes to the 
policy mostly affect the action-value function locally. 

Fortunately, from an optimization perspective, for all 
u G Z\Z, J 7 (u) < min u > e z J 7 (u'), in other words, 
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Z represents all the largest local maxima in 2, with 
respect to the objective, J 7 . Local optimization tech- 
niques, like random restarts, should help ensure that 
we converge to larger maxima and so to u £ Z. Even 
with the true gradient, these approaches would be in- 
corporated into learning because our objective, J 7 , is 
non-convex. 

2.3. The Actor: Incremental Update 
Algorithm with Eligibility Traces 

We now derive an incremental update algorithm using 
observations sampled from the behavior policy. First, 
we rewrite Equation [5] as an expectation: 



g(u) = E 



E 



V u7 r(a| S )Q^( S , ( 

aeA 



ir(a\s) V u 7r(a|s) 
b(a\s) 7r(a|s) 



)7T,7 



(s,a) 



.aeA 

E [p(s, a)V>(s, a)CF'"t(s, a)\s ~ d b , a ■ 
E b [p(s t , Ot)ip(s t , a t )(9 7T ' 7 (s t , a t )] 



b(-\s)] 



where p(s,a) = ip(s,a) = V ^ s \ s) , and we in- 

troduce the new notation Eb [•] to denote the expecta- 
tion implicitly conditional on all the random variables 
(indexed by time step) being drawn from their limiting 
stationary distribution under the behavior policy. A 
standard result (e.g., see Sutton et al., 2000) is that an 
arbitrary function of state can be introduced into these 
equations as a baseline without changing the expected 
value. We use the approximate state-value function 
provided by the critic, V , in this way: 

g(u) = E b p(s u a t )i)(s t ,a t ) (Q wn (s t , a t ) - V(s t )j 

The next step is to replace the action value, 
Q 7r,7 (s t , a t ), by the off-policy A-return. Because these 
are not exactly equal, this step introduces a further 
approximation: 

g(u) w g(u) = E b p(s t ,at)il)(s t , a t ) (i? t A - V(s t ) 

where the off-policy A-return is defined by: 

Rt = n+i + (1 - xh(s t+1 )v( St+1 ) 

+ \-f(s t+ i)p(s t+1 ,a t+ i)Rt +1 

Finally, based on this equation, we can write the for- 
ward view of Off-PAC: 



u t+ i - u t 



Vu,tp(s t ,at)ip{s t , a t ) \R$ - V{s t )j 



The forward view is useful for understanding and an- 
alyzing algorithms, but for a mechanistic implemen- 
tation it must be converted to a backward view that 



Algorithm 1 The Off-PAC algorithm 

Initialize the vectors e„, e n , and w to zero 
Initialize the vectors v and u arbitrarily 
Initialize the state s 
For each step: 

Choose an action, a, according to b(-\s) 

Observe resultant reward, r, and next state, s' 

S <— r + 7(s')v T x s / — v T x s 

p <- n u (a\s)/b(a\s) 

Update the critic (GTD(A) algorithm): 
e v <- p(x s + 7(s)Ae v ) 
v <- v + a v [Se v - j(s')(l - A)(w T e t ,)x s ] 
w i- w + a w [Se v - (w T x s )x s ] 

Update the actor: 



e„ <- p 
u u 

s <- s' 



V u 7r u (q|s) 
7r u (a|s) 



does not involve the A-return. The key step, proved in 
the appendix, is the observation that 



p{s t ,a t )ij{s t ,a t )(Rt -V(s t )) =E b [S t e t ] (6) 



where S t = r t+1 + j(s t+ i)V(s t+ i) - V(s t ) is the con- 



is 



vcntional temporal difference error, and e t 6 
the eligibility trace of ip, updated by: 

e t = p(s t ,a t ) (ip(s t ,a t ) + Ae t _i) 



Finally, combining the three previous equations, the 
backward view of the actor update can be written sim- 
ply as: 

Ut+i - u t = <x M <5 t e t 

The complete Off-PAC algorithm is given above as Al- 
gorithm[T] Note that although the algorithm is written 
in terms of states s and s', it really only ever needs 
access to the corresponding feature vectors, x s and 
x s /, and to the behavior policy probabilities, b(-\s), for 
the current state. All of these are typically available 
in large-scale applications with function approxima- 
tion. Also note that Off-PAC is fully incremental and 
has per-time step computation and memory complex- 
ity that is linear in the number of weights, N u + N v . 

With discrete actions, a common policy distribution 
is the Gibbs distribution, which uses a linear combi- 



nation of features 7r(a|s) 



where 



are 



state-action features for state s, action a, and where 

VK 5 , ) = ^ff = <£»,a - J2b' K { b \ s ) ( t>s,b- The state- 
action features, 4> s ,a, are potentially unrelated to the 
feature vectors x, used in the critic. 
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3. Convergence Analysis 

Our algorithm has the same recursive stochastic form 
as the off-policy value-function algorithms 

u t +i = u t + a t (h(u u v t ) + M t+ i) 

where h : — > M. N is a differentiable function and 
{M t }t>o is a noise sequence. Following previous off- 
policy gradient proofs (Maei, 2011), we study the be- 
havior of the ordinary differential equation 

u(i) = u(/i(u(t),v)) 

The two updates (for the actor and for the critic) are 
not independent on each time step; we analyze two 
separate ODEs using a two timescale analysis (Borkar, 
2008). The actor update is analyzed given fixed critic 
parameters, and vice versa, iteratively (until conver- 
gence). We make the following assumptions. 

(Al) The policy viewed as a function of u, 7iv.)(a|s) : 
R^" — > (0,1], is continuously differentiable, Vs G 
S,aeA. 

(A2) The update on u t includes a projection operator, 
r : t^" — > R N ", that projects any u to a com- 
pact set U = {u\ qi (u) < 0,i = l,...,s} C t^", 
where q.i(-) : M. Nu —> R are continuously differen- 
tiable functions specifying the constraints of the 
compact region. For u on the boundary of U, 
the gradients of the active qi are linearly indepen- 
dent. Assume the compact region is large enough 
to contain at least one (local) maximum of J 7 . 

(A3) The behavior policy has a minimum positive value 
frmin € (0, 1]: b(a\s) > b m i n Vs G S,a G A 

(A4) The sequence (x t , x t+ i, r t+ i) t >o is i.i.d. and has 
uniformly bounded second moments. 

(A5) For every u E U (the compact region to which u 
is projected), V^^ : S — > R is bounded. 

Remark 1: It is difficult to prove the boundedness of 
the iterates without the projection operator. Since we 
have a bounded function (with range (0, 1]), we could 
instead assume that the gradient goes to zero expo- 
nentially as u — > oo, ensuring boundedness. Previous 
work, however, has illustrated that the stochasticity in 
practice makes convergence to an unstable equilibrium 
unlikely (Pemantle, 1990); therefore, we avoid restric- 
tions on the policy function and do not include the 
projection in our algorithm 

Finally, we have the following (standard) assumptions 
on features and step-sizes. 

(PI) Ijxtjloo < oo, Vt, where x f G R Nv 



(P2) Matrices C = £[x t x t T ], A = £[x t (x t - 7 x t+ i) T ] 
are non-singular and uniformly bounded. A, C 
and i?[r t+ ix t ] are well-defined because the distri- 
bution of (x t , x f+ i, rj+i) does not depend on t. 

(51) a v j, a Wl t, OL u .t > 0, Vt are deterministic such that 
E* a v,t = Et °W = Et a »,t = oo and Y^t a l.t < 

°°> Et a w,t < 00 and Ei a t,t < 00 with ^ -> °- 

(52) Define H(A) = (A + A T )/2 and let 
^min(C~ 1 H(A)) be the minimum eigenvalue 
of the matrix C~ 1 H{A^\ Then a Wr t = r]a Vi t for 
some 77 > max(0, — A min (C _1 ff (A))). 

Remark 2: The assumption a u ^/a v ,t — > in (SI) 
states that the actor step-sizes go to zero at a faster 
rate than the value function step-sizes: the actor up- 
date moves on a slower timescale than the critic up- 
date (which changes more from its larger step sizes). 
This timescale is desirable because we effectively want 
a converged value function estimate for the current 
policy weights, u t . Examples of suitable step sizes are 

&V,t = JJ (%U,t = l+t log t ° r ^V.t = ~£2/3 ] a u,t J- 

(with a w> t — T]a v ,t for 77 satisfying (S2)). 

The above assumptions are actually quite unrestric- 
tive. Most algorithms inherently assume bounded fea- 
tures with bounded value functions for all policies; 
unbounded values trivially result in unbounded value 
function weights. Common policy distributions are 
smooth, making n(a\s) continuously differentiable in 
u. The least practical assumption is that the tuples 
(x t , x t+1 , r t+1 ) are i.i.d., in other words, Martingale 
noise instead of Markov noise. For Markov noise, our 
proof as well as the proofs for GTD(A) and GQ(A), 
require Borkar's (2008) two-timescale theory to be ex- 
tended to Markov noise (which is outside the scope of 
this paper) . Finally, the proof for Theorem [3] assumes 
A = 0, but should extend to A > similarly to GTD(A) 
(see Maei, 2011, Section 7.4, for convergence remarks). 

We give a proof sketch of the following convergence 
theorem, with the full proof in the appendix. 

Theorem 3 (Convergence of Off-PAC). Let A = and 

consider the Off-PAC iterations with GTD(0]^for the 
critic. Assume that (Al) {(A5]\ (P1)-(P2) and (Sl)- 
(S2) hold. Then the policy weights, u t , converge to 
Z = {u G hi I g(u) = 0} and the value junction 
weights, v t , converge to the corresponding TD-solution 
with probability one. 

Proof Sketch: We follow a similar outline to the 
two timescale analysis for on-policy policy gradient 

1 Minimum exists as all eigenvalues real- valued (Lemma |Z| 
2 GTD(0) is GTD(A) with A = 0, not the different algo- 
rithm called GTD(0) by Sutton, Szepesvari & Maei (2008) 
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actor-critic (Bhatnagar et al., 2009) and for nonlinear 
GTD (Maei et al., 2009). We analyze the dynamics 
for our two weights, u t and z t T = (w t T v t T ), based on 
our update rules. The proof involves satisfying seven 
requirements from Borkar (2008, p. 64) to ensure con- 
vergence to an asymptotically stable equilibrium. ■ 

4. Empirical Results 

This section compares the performance of Off-PAC to 
three other off-policy algorithms with linear memory 
and computational complexity: 1) Q(A) (called Q- 
Learning when A = 0), 2) Grcedy-GQ (GQ(A) with 
a greedy target policy), and 3) Softmax-GQ (GQ(A) 
with a Softmax target policy). The policy in Off-PAC 
is a Gibbs distribution as defined in section [2. 31 

We used three benchmarks: mountain car, a pendulum 
problem and a continuous grid world. These prob- 
lems all have a discrete action space and a continu- 
ous state space, for which we use function approxima- 
tion. The behavior policy is a uniform distribution 
over all the possible actions in the problem for each 
time step. Note that Q(A) may not be stable in this 
setting (Baird, 1995), unlike all the other algorithms. 

The goal of the mountain car problem (see Sutton & 
Barto, 1998) is to drive an underpowered car to the 
top of a hill. The state of the system is composed of 
the current position of the car (in [—1.2,0.6]) and its 
velocity (in [—.07, .07]). The car was initialized with 
a position of -0.5 and a velocity of 0. Actions are a 
throttle of { — 1, 0, 1}. The reward at each time step 
is —1. An episode ends when the car reaches the top 
of the hill on the right or after 5,000 time steps. 

The second problem is a pendulum problem (Doya, 
2000) . The state of the system consists of the angle (in 
radians) and the angular velocity (in [—78.54,78.54]) 
of the pendulum. Actions, the torque applied to the 
base, are {—2, 0, 2}. The reward is the cosine of the 
angle of the pendulum with respect to its fixed base. 
The pendulum is initialized with an angle and an angu- 
lar velocity of (i.e., stopped in a horizontal position). 
An episode ends after 5,000 time steps. 

For the pendulum problem, it is unlikely that the be- 
havior policy will explore the optimal region where the 
pendulum is maintained in a vertical position. Conse- 
quently, this experiment illustrates which algorithms 
make best use of limited behavior samples. 

The last problem is a continuous grid-world. The 
state is a 2-dimensional position in [0, l] 2 . The ac- 
tions are the pairs {(0.0,0.0), (-.05,0.0), (.05,0.0), 
(0.0, —.05), (0.0, .05)}, representing moves in both di- 
mensions. Uniform noise in [—.025, .025] is added 
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Figure 1. Example of one trajectory for each algorithm 
in the continuous 2D grid world environment after 5,000 
learning episodes from the behavior policy. Off-PAC is the 
only algorithm that learned to reach the goal reliably. 



to each action component. The reward at each 
time step for arriving in a position (p x ,p y ) is de- 
fined as: -1 + -2(Af(p x , .3, .1) ■ J\f(p y , .6, .03) + 

N{ Px , .4, m)-M(p y , .5, .i)+A/"(p a , .8, myAf(p y , .9, .1)) 

_ (p-m) 2 

where M{p,fi,a) — e ^ /a\/2-K. The start posi- 
tion is (0.2,0.4) and the goal position is (1.0, 1.0). An 
episode ends when the goal is reached, that is when 
the distance from the current position to the goal is 
less than 0.1 (using the Ll-norm), or after 5,000 time 
steps. Figure [T] shows a representation of the problem. 

The feature vectors x s were binary vectors constructed 
according to the standard tile-coding technique (Sut- 
ton & Barto, 1998). For all problems, we used ten 
tilings, each of roughly 10 x 10 over the joint space 
of the two state variables, then hashed to a vector of 
dimension 10 6 . An addition feature was added that 
was always 1. State-action features, ip a ,a: were also 
10 6 + 1 dimensional vectors constructed by also hash- 
ing the actions. We used a constant 7 = 0.99. All 
the weight vectors were initialized to 0. We performed 
a parameter sweep to select the following parameters: 
1) the step size a v for Q(A), 2) the step-sizes a v and 
a w for the two vectors in Greedy-GQ, 3) a v , a w and 
the temperature r of the target policy distribution for 
Softmax-GQ and 4) the step sizes a v , a w and a u for 
Off-PAC. For the step sizes, the sweep was done over 
the following values: {10" 4 , 5 • 10" 4 , 10" 3 , . . . , .5, 1.} 
divided by 10+1=11, that is the number of tilings 
plus 1. To compare TD methods to gradient-TD meth- 
ods, we also used a w — 0. The temperature parame- 
ter, t, was chosen from {.01, .05, .1, .5, 1, 5, 10, 50, 100} 
and A from {0, .2, .4, .6, .8, .99}. We ran thirty runs 
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Figure 2. Performance of Off-PAC compared to the performance of Q(A), Greedy-GQ, and Softmax-GQ when learning 
off-policy from a random behavior policy. Final performance selected the parameters for the best performance for the 
last 10% of the run, whereas the overall performance was over all the runs. The plots on the top show the learning curve 
for the best parameters for the final performance. Off-PAC had always the best performance and was the only algorithm 
able to learn to reach the goal reliably in the continuous grid world. Performance is indicated with the standard error. 



with each setting of the parameters. 

For each parameter combination, the learning algo- 
rithm updates a target policy online from the data 
generated by the behavior policy. For all the prob- 
lems, the target policy was evaluated at 20 points in 
time during the run by running it 5 times on another 
instance of the problem. The target policy was not up- 
dated during evaluation, ensuring that it was learned 
only with data from the behavior policy. 

Figure [2] shows results on three problems. Softmax- 
GQ and Off-PAC improved their policy compared to 
the behavior policy on all problems, while the improve- 
ments for Q(A) and Greedy-GQ is limited on the con- 
tinuous grid world. Off-PAC performed best on all 
problems. On the continuous grid world, Off-PAC was 
the only algorithm able to learn a policy that reliably 
found the goal after 5,000 episodes (see Figure [T]). On 
all problems, Off-PAC had the lowest standard error. 

5. Discussion 

Off-PAC, like other two-timescale update algorithms, 
can be sensitive to parameter choices, particularly the 
step-sizes. Off-PAC has four parameters: A and the 



three step sizes, a v and a w for the critic and a u for 
the actor. In practice, the following procedure can 
be used to set these parameters. The value of A, as 
with other algorithms, will depend on the problem and 
it is often better to start with low values (less than 
.4). A common heuristic is to set a v to 0.1 divided 
by the norm of the feature vector, x s , while keeping 
the value of a w low. Once GTD(A) is stable learning 
the value function with a u — 0, a u can be increased 
so that the policy of the actor can be improved. This 
corroborates the requirements in the proof, where the 
step-sizes should be chosen so that the slow update 
(the actor) is not changing as quickly as the fast inner 
update to the value function weights (the critic). 

As mentioned by Borkar (2008, p. 75), another scheme 
that works well in practice is to use the restrictions 
on the step-sizes in the proof and to also subsample 
updates for the slow update. Subsampling updates 
means only updating every {tN, t > 0}, for some N > 
1: the actor is fixed in-between tN and (t + 1)N while 
the critic is being updated. This further slows the 
actor update and enables an improved value function 
estimate for the current policy, tt. 

In this work, we did not explore incremental natural 
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actor-critic methods (Bhatnagar et al., 2009), which 
use the natural gradient as opposed to the conventional 
gradient. The extension to off-policy natural actor- 
critic should be straightforward, involving only a small 
modification to the update and analysis of this new 
dynamical system (which will have similar properties 
to the original update). 

Finally, as pointed out by Precup et al. (2006), off- 
policy updates can be more noisy compared to on- 
policy learning. The results in this paper suggest that 
Off-PAC is more robust to such noise because it has 
lower variance than the action-value based methods. 
Consequently we think Off-PAC is a promising direc- 
tion for extending off-policy learning to a more general 
setting such as continuous action spaces. 

6. Conclusion 

This paper proposed a new algorithm for learning 
control off-policy, called Off-PAC (Off-Policy Actor- 
Critic). We proved that Off-PAC converges in a stan- 
dard off-policy setting. We provided one of the first 
empirical evaluations of off-policy control with the new 
gradient-TD methods and showed that Off-PAC has 
the best final performance on three benchmark prob- 
lems and consistently has the lowest standard error. 
Overall, Off-PAC is a significant step toward robust 
off-policy control. 
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A. Appendix of Off-Policy Actor-Critic 

A.l. Policy Improvement and Policy Gradient Theorems 

Theorem [T] [Off- Policy Policy Improvement Theorem] 
Given any policy parameter u, let 

u' = u + «g(u) 
Then there exists an e > such that, for all positive a < e, 

J 7 (u') > J 7 (u) 

Further, if ir has a tabular representation (i.e., separate weights for each state), then V v "' , ~ f (s) > V A7ru ' 7 (s) for 
all seS. 

Proof. Notice first that for any (s, a), the gradient V u 7r(a|s) is the direction to increase the probability of action a 
according to function ir(-\s). For an appropriate step size a Ut t (so that the update to tt u i increases the objective 
with the action- value function Q*'"' 7 , fixed as the old action- value function), we can guarantee that 

J 7 (u) = d b (s) E ^Hs)Q^' 7 (s, a) 
ses aeA 

ses aeA 

Now we can proceed similarly to the Policy Improvement theorem proof provided by Sutton and Barto (1998) 
by extending the right-hand side using the definition of Q 7r ' 7 (s, a) (equation [5]): 

J 7 (u t ) < E d "( s ) E *u'(a|a)E [r t+ i + lt+ iV^ {s t+1 )\^ , 7 ] 
ses aeA 

< E d& ( s ) E 7r u'(a|s)E [r t+1 + 7t+1 rt +2 + 7t+2 ^- 7 ( St+2 )|^, 7 ] 
ses ae.4 

<E db ( s )E^'( a i s )^ u,,7 M 

sGS aeA 

= J T (u') 

The second part of the Theorem has similar proof to the above. With a tabular representation for 7r, we know 
that the gradient satisfies: 

E 7T u (a\s)Q^( Sl a) < E 7r u ,(a| S )Q^' 7 ( S , a) 

because the probabilities can be updated independently for each state with separate weights for each state. 
Now for any s£5: 

V^( S ) = J2^Ms)Q^(s,a) 

aeA 
aeA 

< E Ma|*)E [rt+i + 7 f +i^ u ' 7 (s t+ i)kus 7 ] 

< E 7r u'(a|s)E [r t +i + 7f+ ir t+2 + 7 t+2^ u ' 7 (st +2 )ku', 7 ] 

aeA 

< E^u'(a|s)Q^" 7 (s,a) 



□ 
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Theorem [2] [Off- Policy Policy Gradient Theorem] 

Let Z = {u £ U | V u J 7 (u) = 0} and Z = {ueW | g(u) = 0}, which are both non-empty by Assumption (A2) 
If the value function can be represented by our function class, then 

Z C i 



Moreover, if we use a tabular representation for 7r, then 

Z = Z 



Proof. This theorem follows from our policy improvement theorem. 

Assume there exists u* g Z such that u* <^ Z. Then V u * J 7 (u) = but g(u*) ^ 0. By the policy improvement 
theorem (Theorem [T]) , we know that J 7 (u* + a u t g(u*)) > J 7 (u), for some positive a u ^. However, this is a 
contradiction, as the true gradient is zero. Therefore, such an u* cannot exist. 

For the second part of the theorem, we have a tabular representation, in other words, each weight corresponds 
to exactly one state. Without loss of generality, assume each state s is represented with m £N weights, indexed 
by let i s .i . . . i s , m in the vector u. Therefore, for any state, s 

E d V)E T^"Ws')Q^ 7 (s',a) = db ( s )E ^-^uH S )Q^' 7 (s,a) = gi(u w ) 

s'eS aeA 3 '3 a£A s - 3 



Assume there exists s £ S such that gi(uj s ,.) = Vj but there exists 1 < k < m for g2(ui sfc ) = 
E S 'e5 ofb ( s ')I]Qe^ 7r u(a|s')anf7 ( 3 7ru ' 7 ( s '' a ) such tnat g2(u is J ^ 0. a^jQ 7ru '' r (s / , a) can only increase the 
value of <5 7I ' u ' 7 (s, a) locally (i.e., shift the probabilities of the actions to increase return), because it cannot 
change the value in other states (u is is only used for state s and the remaining weights are fixed when this 
partial derivative is computed). Therefore, since g2(ui s k ) ^ 0, we must be able to increase the value of state s 
by changing the probabilities of the actions in state s 

m „ 
j = l aeA ls - j 



which is a contradiction (since we assumed gi(u is ) = Vj). 

Therefore, in the tabular case, whenever J2 S ^ h ( s ) J2 a V u 7'"u(a|s)Q 7ru ' 7 (s, a) 



0, then 



E s d b ( S )Ea^(a\s)V u Q^(s, 



0, implying that Z C Z. Since we already know that Z C Z, then 



we can conclude that for a tabular representation for n, Z = Z. 
A.2. Forward/Backward view analysis 

In this section, we prove the key relationship between the forward and backward views: 



□ 



p(st,at)ip(st,a t ) (V - V"(*t))] = E 6 [5 t e t ] 



(6) 



where, in these expectations, and in all the expectations in this section, the random variables (indexed by time 
step) are from their stationary distributions under the behavior policy. We assume that the behavior policy 
is stationary and that the Markov chain is aperiodic and irreducible (i.e., that we have reached the limiting 
distribution, d b , over s £ S). Note that under these definitions: 

E b [X t ] = E fc [X t+k ] (7) 



for all integer k and for all random variables X t and X t +k that are simple temporal shifts of each other. To 
simplify the notation in this section, we define pt = p(st,at), ipt — ?P(st,cit), It — l( s t), and 6* = — V(st). 
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Proof. First we note that 5*, which might be called the forward- view TD error, can be written recursively: 
^t = Rt - V(s t ) 

= r t+ i + (1 - X)j t +iV(s t+1 ) + X^t+iPt+iRt+i ~ V(s t ) 
= r t +i + j t +iV(s t+1 ) - Xj t+1 V(s t+1 ) + X-ft+iPt+iRt+i ~ V(s t ) 
= r t+1 + j t+1 V(s t+ i) ~ V(s t ) + A 7t+ i (p m i? t A +1 - V(s t+ i)) 

= S t + Xjt+i [pt+iRt+i ~ p t +\V{s t+1 ) - (1 - p t+1 )V(s t+1 ) ] 

= S t + Xjt+i (pt+iSt+i - (1 - pt+i)V(s t+ i)^ 



(8) 



where S t = rt+i + jt+iV (st+i) — V(st) is the conventional one-step TD error 
Second, we note that the following expectation is zero: 
E 6 



Pti>at+i{i - p t +i)V{s t+ i)\ 

= E d "( s ) E Ka\s)p(s, aMa, a) £ P(s% afrOO (l - ]T b(a'\s')p(s' , a')) V(s') 

s a s' \ a' / 

= E d "( s ) E b ^\s)p{s, a)iP(s, a) J2 P(s'\s, ah(s') ^1 - £ 6( a '| s ')f^^ V(s') 
= E E Ka\s)p(s, 0)4(8, a) £ P( S '\s, ah(s') (l - £ *W)) H*') 



(9) 



We are now ready to prove Equation [6] simply by repeated unrolling and rewriting of the right-hand side, using 
Equations ||J|9j and [7] in sequence, until the pattern becomes clear: 

Pt4t (Rt - v(s t )) 



= E h 



Pt4t Ut + Xjt+i (pt+iSt+i - (1 - p t +i)V(s t+ i)J^ 



(using J8 



= E h [pt^tSt] + E 6 [ptiptXjt+iPt+iSt+i] - E fc pt^ t A7 t+ i(l - pt+i)V(s t +i) 

= E fc [p f -0t<5t] + E b [p t 4Alt+iPt+iSt+i\ 
= Eh [pt4t$t] + E b [p t _iV't-iA7 t p t (5 t A ] 

= Efc [pt4tSt] + Eb pt-iipt-iXjt.pt (St + Xjt+i (pt+i^t+i - (1 - p t+ i)V r (s t+ i)^ (using ([8 

= Eh [pt4tSt] + Eb [pt-i4t-AltPtS t ] + Eb [p t _i'0t-iA7 t p t A7 t+ i ) o t+ i^ A +1 ] 
= Efc [p t S t (ipt + Xj t Pt-i4t-i)} + Eb [X 2 p t - 2 4t-2lt-iPt-iltPtSt] 

= Eh [p t S t (ipt + Xj t pt-i4t-i)] + Efc A 2 p t -24t-2lt-iPt-iltPt (j>t + A7t+i (p t+ i<5 A +1 - (1 - p i+ i)F(s t+ i m 

= Efc [p t S t (ipt + XjtPt-iipt-i)} + Eb [X 2 p t -24t-2jt-iPt-iJtPtSt] + Eb [X 2 p t -24t-2lt-iPt-iltPtX^t+iPt+i5t+i\ 
= Efc [p t £ t (-0* + X-ftPt-i (4t-i + ^lt-iPt-24t-2))] +Efc [A 3 ( o t _ 3 -0t-37t-2Pt-27t-iPt-i7tPt^ A ] 



(using <J9J) ) 
(using @) 

(using 

(using (§) 
(using @) 



= Efc [p t 5 t (4t + ^ItPt-l (lpt-1 + X^t-lPt-2 {lpt-2 + A 7t _2Pi-3 • ■ •)))] 

= Efc [S t et] 



where e t = p t (t/»t + A 7t e t _i). 



□ 



Off-Policy Actor-Critic 



A. 3. Convergence Proofs 

Our algorithm has the same recursive stochastic form that the two-timescale off-policy value-function algorithms 
have: 

U t +i =u t + a t (h(u t , z t ) + M t+ i) 
z t +i =zt + a t (f(u t ,z t ) + N t +i) 

where x € R d , h : R d -> R d is a differentiable functions, {a{\k>a is a positive step-size sequence and {M t }k>o 
is a noise sequence. Again, following the GTD(A) and GQ(A) proofs, we study the behavior of the ordinary 
differential equation 

u(t) = h(u(t), z) 

Since we have two updates, one for the actor and one for the critic, and those time updates are not linearly 
separable, we have to do a two timescale analysis (Borkar, 2008). In order to satisfy the conditions for the 
two-timescale analysis, we will need the following assumptions on our objective, the features and the step-sizes. 
Note that it is difficult to prove the boundedness of the iterates without the projection operator we describe 
below, though the projection was not necessary during experiments. 

(Al) The policy function, Tr/.\(a\s) : R*" — > [0, 1], is continuously differentiable in u, Vs £ S, a G A. 

(A2) The update on u t includes a projection operator, T : R Wu —> M. Nm that projects any u to a compact set 
U = {u | qi(u) < 0,i = 1, . . . , s} C M. Nu , where : 18"" — > R are continuously differentiable functions 
specifying the constraints of the compact region. For each u on the boundary of U, the gradients of the 
active qi are considered to be linearly independent. Assume that the compact region, U, is large enough to 
contain at least one local maximum of J 7 . 

(A3) The behavior policy has a minimum positive weight for all actions in every state, in other words, b(a\s) > 6 m i n 
Vs G S, a G -A, for some 6 m i n G (0, 1]. 

(A4) The sequence (x t , x t+ i, rt+i)t>o is i.i.d. and has uniformly bounded second moments. 

(A5) For every u G U (the compact region to which u is projected), V™ un : S — > R is bounded. 

(PI) HxtHoo < oo, Vt, where x f G R Nv 

(P2) The matrices C = i?[x t x t T ] and A = E{x t (x t — jxt+i ) T ] are non-singular and uniformly bounded. A, C 
and -E[r t+ ix t ] are well-defined because the distribution of (x t , x t+ i, r f+ i) does not depend on t. 

(51) a Vt t, oc w ,ti cn u ,t > 0, Vi are deterministic such that J2t a v,t — J2t a w,t — J2t a ^,t = 00 an d J2t a v,t < °°j 
E* a w,t < o° and J2t a l,t < 00 with ^7 -> °- 

(52) Define H(A) = (A + A T )/2 and let Xmin(C' _1 H(A)) be the minimum eigenvalue of the matrix C^ 1 H(A). 
Then a w .t — T)a v> t for some r\ > maxO, —Xmin{C^ 1 H{A)). 

Theorem [3] (Convergence of Off-PAC) Let A = and consider the Off-PAC iterations for the critic (GTD(A), 
i.e., TDC with importance sampling correction) and the actor (for weig hts u t ). Assume that (Al)- |(A5)[ (Pl> 

(P2) and (S1)-(S2) hold. Then the policy weights, u t , converge to Z = {u G U \ g(u) = 0} and the value function 
weights, v t , converge to the corresponding TD-solution with probability one. 

Proof. We follow a similar outline to that of the two timescale analysis proof for TDC (Sutton et al., 2009). We 
will analyze the dynamics for our two weights, u t , and z t T = (w t T v f T ), based on our update rules. We will take 
u t as the slow timescale update and z t as the fast inner update. 
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First, we need to rewrite our updates for v, w and u, amenable to a two timescale analysis: 

vt+i = v 4 + a v . t pt[5 t xt - 7 x t Twx t] 
w t+ i = w t + a Vtt ri[p t 5 t x.t - x/wxj 

z*+i = z t + a Vjt pt[Gn t ,t+x^t + ?u«,t+i] (10) 
u t+ i = r ut + a Utt 6 t — ' (11) 



b( a t\ s t) 

where p t = p(s t ,a t ), S t = r t+1 +j(s t+1 )V(s t+ i) - V(s t ), r\ = a Wft /a Vtt , <7u t: t+i T = (rjp t r t+ iX t r , p t r t+1 yL t T ) 1 and 

-rjx t x t T ?7p t (u t )x t (7x t+ i - x t ) T \ 



G 



u t ,t+l 



-7Pt(u t )x t+ iX t T pt(u i )x i (7X t+ i - x t ) 



Note that G u = £[G u t|u] and q u = E[q ut \u] are well defined because we assumed that the process 
(x t , x t+1 , r t+1 )t> is i.i.d., < p t < and we have fixed u t . Now we can define h and /: 

h(z t ,u t ) = G Uf z t + q Ut 

V Ut 7r t (a t |s t ) 



f(z t ,u t )=E 



&(at|s*) 

Mt+i = (G Ut ,t+i — G Ut ) zt + g Ut ,t+i ^ 9u t 

N t+1 = °t win /(Zt.Ut) 

o(a t |s t ) 

We have to satisfy the following conditions from Borkar (2008, p. p64): 
(Bl) h : R N »+ 2N * ^ K 2W V and j- . R N„+2N V _^ R N U are Lipschitz . 

(B2) a Vt t, ct u ,t Vt are deterministic and ^ t a v j = J2t a ^-t — °°7 J2t a v.t < °°> St a u.t < °°! — ^ (i- c -> the 



system in Equation 11 moves on a slower timescale than Equation 10 1 



(B3) The sequences {M t }k>o and {N t }k>o are Martingale difference sequences w.r.t. the increasing a-ficlds, 
F t = a(z m ,u m ,M m ,N m , m<n) (i.e., E[Mt + i\F t ] = 0) 

(B4) For some constant K > 0, £[||M i+1 || 2 |.F t ] < ( 1 1 1 ^ 1 1 2 -F- 1 1 2/ t 1 1 2 ) and £[||i\T t+1 || 2 |.F t ] < ^(l + ||x t || 2 + ||j/ t || 2 ) 
holds for any k > 0. 

(B5) The ODE z(t) = h(z(t),u) has a globally asymptotically stable equilibrium x( u ) where \ '■ ^- N " ^- Nv 1S 
a Lipschitz map. 

(B6) The ODE u(t) = f(x(u(t)),u(t)) has a globally asymptotically stable equilibrium, u*. 
(B7) sup t (||z t || + ||u t ||) < 00, a.s. 

An asymptotically stable equilibrium for a dynamical system is an attracting point for which small perturbations 
still cause convergence back to that point. If we can verify these conditions, then we can use Theorem 2 by 
Borkar (2008) that states that (z t ,u t ) — > (x(u*),u*) a.s. Note that the previous actor-critic proofs transformed 
the update to the negative update, assuming they were minimizing costs, —R, rather than maximizing and 
so converging to a (local) minimum. This is unnecessary because we simply need to prove we have a stable 
equilibrium, whether a maximum or minimum; therefore, we keep the update as in the algorithm and assume a 
(local) maximum. 

First note that because we have a bounded function, 717.) (s, a) : U — > (0, 1], we can more simply satisfy some of 
the properties from Borkar (2008). Mainly, we know our policy function is Lipschitz (because it is bounded and 
continuously diffcrentiable) , so we know the gradient is bounded, in other words, there exists -Byu G K such that 
||V u 7r(o|s)|| < B Vu . 
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For requirement (Bl) h is clearly Lipschitz because it is linear in z and Pt(u) is continuously differentiable 
and bounded (/?t(u) < &";„)■ / is Lipschitz because it is linear in v and V u 7r(a|s) is bounded and continuously 
differentiable (making J 7 with a fixed Gf' 1 continuously differentiable with a bounded derivative). 

Requirement |(B2)| is satisfied by our assumptions. 

Requirement |(B3)| is satisfied by the construction of M t and N t . 

For requirement |(B4)[ we can first notice that M t satisfies the requirement because r t+ i,x t and x t+ i have 
uniformly bounded second moments (which is the justification used in the TDC proof (Sutton et al., 2009) and 
because < p t < 6~f n . 

1 2 I 



= E[ 
<E\ 
<E[ 
< K( 



\ClZ t \ 

IW| 2 



- G Ut )z t 
1)< A(||z 4 || 2 



2 + ||(9u t , t -<7uJ|| 2 |F t ] 



|u t || 2 + l) 



and ||<7u t ,t-gu t || 2 < c 2 
When then simply set 



where the second inequality is by the Cauchy Schwartz inequality, (G Uti t — G Ut )z t < ci|z f 
(because r t+ x,x t and x t+ x have uniformly bounded second moments), with c\,ci G R+ 
K = max(cx, c 2 ). 

For Nt, since the iterates are bounded as we show below for requirement |(B7)| (giving sup 4 ||u t || < B 
sup t ||zi|| < B z for some B U ,B Z G HL ), we see that 



and 



£[||^ t+ iH 2 |F t ] 



< E 

< E 

< 2E 
2 



V Ut 7T t (a t |s t ) 



St 



b{a t \s t ) 
V Ut 7r t (a t |s t ) 
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St — T7 — — |zt,u 4 
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b(a t \s t ) 
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St 



K a t\st) 
V Ut 7r t (a t |s t ) 
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b(a t \s t ) 
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|F t 



< 



6 2 . 

mm 



&(a*l s t 
E[\S t \ 2 B^ t ] 
2 + 1) < A-(| 



V Ut 7r t (a t |s t )ll |F t 



zt 



u t 



because r t+ x,x t and x t+ x have uniformly 
a e A (as stated above because n(a\s) is 



for some K £ K because £J[|5| 2 |J-"t] < cx(l + ||vt||) for some c\ € 
bounded second moments and since ||V u 7r(a|s)|| < -Bvu V s G <S 
Lipschitz continuous) . 

For requirement |(B5)[ we know that every policy, n, has a corresponding bounded F^' 7 (by assumption). 
We need to show that for each u, there is a globally asymptotically stable equilibrium of the system, h(z(t), u) 
(which has yet to be shown for weighted importance sampling TDC, i.e., GTD(A = 0)). To do so, we use 
the Hartman-Grobman Theorem, that requires us to show that G has all negative eigenvalues. For readability, 
we show this in a separate lemma (Lemma |4| below). Using Lemma |4j we know that there exists a function 

X '■ K"" — > M. Nv such that x( u ) — ( v u T w u T ) > where v u is the unique TD-solution value-function weights for 
policy 7r and w u is the corresponding expectation estimate. This function, x, is continuously differentiable with 
bounded gradient (by Lemma [5] below) and is therefore a Lipschitz map. 

For requirement |(B6)| we need to prove that our update f(x(~), •) has an asymptotically stable equilibrium. 
This requirement can be relaxed to a local rather than global asymptotically stable equilibrium, because we 
simply need convergence. Our objective function, J 7 , is not concave because our policy function, Tt(a\s) may not 
be concave in u. Instead, we need to prove that all (local) equilibria are asymptotically stable. 

We define a vector field operator, T : C(R N ") — > C(R N ") that projects any gradients leading outside the compact 
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region, U, back into U: 

H9(y)) = r™ Tiy + h9 h iy)) - y 

n— >o n 

By our forward-backward view analysis and from the same arguments following from Lemma 3 by Bhatnagar et 
al. (2009), we know that the ODE u(i) = /(%(u(t)), u(t)) is g(u). Given that we have satisfied requirements 
1-5 and given our step-size conditions, using standard arguments (c.f. Lemma 6 in Bhatnagar et al., 2009), we 
can deduce that u t converges almost surely to the set of asymptotically stable fixed points, Z, of li = Tg(u). 

For requirement |(B7)| we know that u t is bounded because it is always projected to U. Since u stays in U, we 
know that v stays bounded (by assumption, otherwise V 7rn would not be bounded) and correspondingly w(v) 
must stay bounded, by the same argument as by Sutton et al. (2009). Therefore, we have that sup t ||u t || < B u 
and that sup t ||z t || < B p for some B U ,B Z G K. 

□ 

Lemma 4. Under assumptions (Al) \(A~5)\ (P1)-(P2) and (S1)-(S2), for any fixed set of actor weights, ueK, 
the GTD(\ = 0) update for the critic weights, v t , converge to the TD solution with probability one. 



Proof. Recall that 

q _( -r7X t x t T T]p t (u)x t (7x4+1 - x t ) T 
U '* + y -7 / o t (u)xi+iX t T / o t (u)x t (7x4+1 - x f ) T 

and G u = E [G u t ], meaning 

r _ ( -VC -vA p (u) \ 
Gu - ^ -F p (u) T -A p (u) ) ■ 



where F p (u) — 7E [p t (u)xi+ix t T ] , with C p (u) = A p (u) — ^(u). For the remainder of the proof, we will simply 
write A p and C p , because it is clear that we have a fixed ueW. 

Because GTD(X) is solely for value function approximation, the feature vector, x, is only dependent on the state: 

E [p t x t x t T ] = ^2 d(st)b(at\s t )ptx(st)x t T 

st, at 

= ^2 d(st)7r(a t \st)x(s t )xt T 

st, a t 

= ^2d(s t )x(s t )x t T I ^7r(a t |s t ) j 

s t \ a t / 

= ^d( St )x(s t )x t T = £[x t x t T ] 

St 

because Y] a n(at\st) = 1. A similar argument shows that E [p4X4+ix t T ] = E [x t +ix t T ]. Therefore, we get that 

F p (u) = 7i?[xx t T ] and A p (u) — i?[x t (7x t +i — x t ) T ]. The expected value of the update, G, therefore, is that 
same as for TDC, which has been shown to converge under our assumptions (see Maei, 2011). 

□ 

Lemma 5. Under assumptions (Al) \(A5)\ (P1)-(P2) and (S1)-(S2), let \ ■ U -> V be the map from policy 
weights to corresponding value function, V 7T '' y , obtained from using GTD(\ = 0) (proven to exist by Lemma^. 
Then x is continuously differentiable with a bounded gradient for all u € U. 



Proof. To show that x is continuous, we use the Weierstrass definition (5 — e definition). Because x(u) = 
— G(u)~ 1 g(u) = z u , which is a complicated function of u, we can luckily break it up and prove continuity about 
parts of it. Recall that 1) the inverse of a continuous function is continuous at every point that represents a 
non-singular matrix and 2) the multiplication of two continuous functions is continuous. Since G(u) is always 
nonsingular, we simply need to proof that a(u) — > G(u) and b(u) — > q(u) are continuous. G(u) is composed of 
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several block matrices, including C, F p (u) and A p (u). We will start by showing that u — > F p (u) is continuous, 
where ^(u) = — E [?7p t (u)x t+ ix t T |6] . The remaining entries are similar. 

Take any s £ S, a £ A, and u £ U. We know that n(a\s) :l/4[0, 1] is continuous for all u £ U (by assumption). 

LCt 61 = 7 |^|E[x' +1 x t T|h] 

such that for any u 2 £lA with ||ui — u 2 1 1 < S, then ||7r Ul (ot|st) — 7r U2 (at|st)|| < ei- Now 



(well-defined because E [x i+:L x t T |&] is nonsingular) . Then we know there exists a S > 



|F p (ui) - F p (u 2 )\\ - 7 ||E [ Pt ( Ul )x t+1 x t T ] - E [p t (u 2 )x t+1 x t T ] 



= 7 



E ^(^0H«^0 ^1f x t+1 x t T - £ ^( St) b (at | St ) ![gW Xt+iXt T 



b( a *l s t) 

X] d 6 (s t )[7r Ul (a t |s t ) - 7r U2 (a t |s t )]x t+ ix t T 

< 7 X! d6 ( s t)ll 7I "ui(ai|st) - 7r U2 (a t |s t )||x t+1 x t T 

St.Ot 

< 7 £ i X! rfb ( s *) x t+i x t T 

st, at 

= 7d|>l|E [x t+ ix t T |6] =e 

Therefore, u — > F p (u) is continuous. This same process can be done for A p (u) and E [p t (u)r t x t |6] in g(u). 

Since u — > G and u — > q are continuous for all u, we know that x(u) = — G(u)~ 1 q(u) is continuous. 

The above can also be accomplished to show that V u x is continuous, simply by replacing it with V u 7r above. 
Finally, because our policy function is Lipschitz (because it is bounded and continuously differentiable), we know 
that is has a bounded gradient. As a result, the gradient of \ is bounded (since we have nonsingular and bounded 
expectation matrices), which would again follow from a similar analysis as above. □ 



